{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Loading Data From An HTTP Server Tutorial\n",
    "\n",
    "MLDB gives users full control over where and how data is persisted. MLDB handles multiple protocol for URLs (see [Files and URLs](../../../../doc/#builtin/Url.md.html)). In this tutorial, we provide examples to load files via <code> http:// </code> or <code> https://</code> for files accessible on a HTTP server on the public internet or a private intranet.\n",
    "\n",
    "For an example using the <code> file:// </code> for a file inside an MLDB container, see the [Loading Data Tutorial](../../../../ipy/notebooks/_tutorials/_latest/Loading%20Data%20Tutorial.ipynb) for an example. MLDB also supports loading files from Amazon S3 and SFTP servers transparently. See the documentation for [Files and URLs](../../../../doc/#builtin/Url.md.html) for more details."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The notebook cells below use `pymldb`'s `Connection` class to make [REST API](../../../../doc/#builtin/WorkingWithRest.md.html) calls. You can check out the [Using `pymldb` Tutorial](../../../../doc/nblink.html#_tutorials/Using pymldb Tutorial) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from pymldb import Connection\n",
    "mldb = Connection()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading data with http:// or https://\n",
    "\n",
    "MLDB makes it very easy to load data from a public web server, since a file location can be specified using a remote URI. To illustrate this, we have chosen to load a file from the [Facebook Social Circles](http://snap.stanford.edu/data/egonets-Facebook.html) dataset, hosted by the [Stanford Network Analysis Project (SNAP)](http://snap.stanford.edu/index.html), who provide many public datasets. \n",
    "\n",
    "We will simply import the file `http://snap.stanford.edu/data/facebook_combined.txt.gz` using the [`import.text` procedure](../../../../doc/#builtin/procedures/importtextprocedure.md.html). Notice that not only is the file hosted on a remote server, but it is also compressed. MLDB will decompress it seamlessly as it's being downloaded. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<Response [201]>\n"
     ]
    }
   ],
   "source": [
    "dataUrl = \"http://snap.stanford.edu/data/facebook_combined.txt.gz\"\n",
    "\n",
    "print mldb.put(\"/v1/procedures/import_data\", {\n",
    "    \"type\": \"import.text\",\n",
    "    \"params\": {\n",
    "        \"dataFileUrl\": dataUrl,\n",
    "        \"headers\": [\"node\", \"edge\"],\n",
    "        \"delimiter\": \" \", \n",
    "        \"quoteChar\": \"\",\n",
    "        \"outputDataset\": \"import_URL1\",\n",
    "        \"runOnCreation\": True\n",
    "    }\n",
    "})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now take a look:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>edge</th>\n",
       "      <th>node</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>_rowName</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          edge  node\n",
       "_rowName            \n",
       "1            1     0\n",
       "2            2     0\n",
       "3            3     0\n",
       "4            4     0\n",
       "5            5     0"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mldb.query(\"SELECT * FROM import_URL1 LIMIT 5\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Accessing a specific file inside an archive\n",
    "\n",
    "If the targeted file is inside an archive (`.tar` or `.zip`), we can specify the specific file we want to extract, as seen in the example below. Here, we load the `3980.circles` file within the `facebook` folder:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<Response [201]>\n"
     ]
    }
   ],
   "source": [
    "dataUrl = \"http://snap.stanford.edu/data/facebook.tar.gz\"\n",
    "\n",
    "print mldb.put(\"/v1/procedures/import_data\", {\n",
    "    \"type\": \"import.text\",\n",
    "    \"params\": {\n",
    "        \"dataFileUrl\": \"archive+\" + dataUrl + \"#facebook/3980.circles\",\n",
    "        \"headers\": [\"circles\"],\n",
    "        \"delimiter\": \" \", \n",
    "        \"quoteChar\": \"\",\n",
    "        \"outputDataset\": \"import_URL2\",\n",
    "        \"runOnCreation\": True\n",
    "    }\n",
    "})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's query our dataset to see what the data looks like:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>circles</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>_rowName</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>circle0\\t3989\\t4009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>circle1\\t4010\\t4037</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>circle2\\t4013</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>circle3\\t4024\\t3987\\t4015</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>circle4\\t4006</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                            circles\n",
       "_rowName                           \n",
       "1               circle0\\t3989\\t4009\n",
       "2               circle1\\t4010\\t4037\n",
       "3                     circle2\\t4013\n",
       "4         circle3\\t4024\\t3987\\t4015\n",
       "5                     circle4\\t4006"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mldb.query(\"SELECT * from import_URL2 LIMIT 5\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next step would be to format the data in a way we can easily query it. This is shown in the [Executing JavaScript Code Directly in SQL Queries Using the jseval Function Tutorial](../../../../ipy/notebooks/_tutorials/_latest/Executing%20JavaScript%20Code%20Directly%20in%20SQL%20Queries%20Using%20the%20jseval%20Function%20Tutorial.ipynb), where we structure the data in a nicer way."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "With support for multiple protocol types, MLDB makes it easy to load data that resides anywhere. As seen in this tutorial, you can even pinpoint the exact file to load within an archive's folder structure, allowing flexible data management."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Where to next?\n",
    "\n",
    "Check out the other [Tutorials and Demos](../../../../doc/#builtin/Demos.md.html)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
